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Pure logical thinking cannot yield us any knowledge of the empirical world: 
all knowledge of reality starts from experience and ends in it. 

Albert Einstein, 195 4. 



Introduction 

In the last 20 years the window on our universe has opened to an unprecedented 
level, allowing us to bridge a large fraction of the gap that existed between observa- 
tional fields, like astrophysics and cosmology, and more experimental ones, like, for 
instance, high energy particle physics. A marked difference between observational 
and experimental disciplines which is often emphasized is that the former ones usu- 
ally lack direct control on the object that we wish to study, which is usually not 
directly accessible to us. A consequence of this is that, contrary to experimental 
disciplines (where we can mostly prepare the system under study to suit our needs 
and tackle the data acquisition and analysis under what we judge to be the most 
appropriate and fruitful conditions) in cosmology and astrophysics we usually do 
not have this convenience. The lack of control on the systems under study and on 
their environments can then result in uncertainties which are usually higher than 
the ones obtained in experimental situations and make more challenging to single 
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out discrepancies between models and data. Although this stereotype is true to 
some extent, the situation has now radically changed, as witnessed, for instance, 
by precision measurements of the cosmic microwave background radiation, or by 
millisecond pulsar timing. If it is true that, in general, we have no direct control on 
the phenomena that take place in our universe, it has nevertheless to be recognized 
that they are often probing the laws of physics in such extreme situations, as we 
will never be able to realize in a human controlled experiment. So, while on the 
theoretical side we are still struggling to get a unified picture of fundamental in- 
teractions at the quantum level, it is undoubted that nowadays this understanding 
has to also, if not primarily, face the challenge to be consistent with astrophysical 
and cosmological data, in addition to those provided by human-scale experiments: 
in this sense, a synergic interplay between large and small scale physics is already 
a reality. 

With this higher perspective in mind, we will more modestly present, in this 
contribution, an example of how current observations have the potential to constrain 
emission models for Active Galactic Nuclei. Our emphasis will be on the importance 
of a solid statistical analysis, a precious approach to obtain a quantitative insight 
and guide us in challenging and, hopefully, disproving existing models in favor of 
more refined and realistic ones. 

Our presentation will try to be reasonably self-contained, in the sense that we will 
cover all the required topics in what follows: we will, however, not be able to cover 
them with the required depth, some of which can be obtained starting from the list 
of references. We will start in Sec. [TJ summarizing some key facts about the sources 
that we will consider here, namely active galactic nuclei. Some more details will be 
given for blazars and their emission models in Subsec. 11.11 We will then continue 
with a concise presentation of algorithms for (least square) minimization in Sec. [H 
to finally introduce a solid standard, which is the Levenberg-Marquardt approach 
(Subsec. I2.3[) . The problem of statistical significance is then briefly addressed in 
Sec. [31 where the Kolmogorov-Smirnov test is analyzed (Subsec. 13. 3ft : other methods 
to test how bad a fit iio do exist, but will not be discussed here. Following these 
prerequisites, Sec. @] presents details of a recent analysis in which multi- wavelength 
datasets of the active galactic nucleus Markarian 421 where fitted to a given emission 
model: this section contains three subsections, corresponding to each of the topics 
that are presented in the three background sectionso Subsec. 14.11 describes the 
source and the chosen datasets, Subsec. 14.21 describes the fit algorithm and the fit 
results, Subsec. [4731 describes the statistical significance of the results. 



a The commonly used name for these tests is goodness of fit tests, although, technically, the mean- 
ingful results are those in which these tests fail, showing in this way that the model needs to be 
refined. 

b Numbering has be chosen so that each subsection index matches the number of the section in 
which the corresponding topic is discussed in general. 
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Statistical study of emission versus activity of AGNs 3 

1. Active Galactic Nuclei 

Active galactic Nuclei are galaxies characterized by a core that appears to be 
brighter and more energetic than that of other galaxies. It is, indeed, rather com- 
mon that the core luminosity competes, or even exceeds, the one of the host galaxy: 
in some cases it seems that a luminosity of the order of 10 4 times the one of a stan- 
dard galaxy can be associated with a region with a linear size significantly smaller 
than 1 pc. In addition, non-thermal AGNs emission covers a very broad range of 
the spectrum. 

It is usually believed that the engine at the core of AGNs is a black-hole with a 
total mass of millions or billions of solar masses: such an object is called a supermas- 
sive black hole. Although there is only indirect evidence that this supermassive core 
(SMC) is a black hole, there is usually agreement about the fact that the energy 
necessary to sustain a luminosity as large as the one mentioned above is coming 
from the infall of surrounding matter onto the SMC: this process is generically called 
accretion. The details of matter accretion onto the SMC are still to be understood. 

Another striking feature of AGNs is, in some cases, the presence of two powerful, 
highly collimated jets that shoot out in opposite directions. The way in which these 
jets are formed and other fundamental aspects, for instance related to their com- 
position and to the details of the processes involved in the acceleration of particles 
inside these jets, are also subject of active research. 

From this short, qualitative, introduction it appears that there is still a lot that 
we have to understand about AGNs. Nevertheless, several important steps forward 
have been made and we will concentrate, in what follows, on what we think we do 
actually understand. From an historical perspective, it is important to recall that 
what nowadays we call AGNs comprises a very wide class of extragalactic objects, 
which observationally can (and do) appear very different one from the other. For 
this reason, we will report below a standard classification scheme!^ 

There are several characteristics that can be used to classify AGNs, and their 
common ground is that they are not usually found in standard galaxies. We already 
mentioned one of them, which is the high luminosity. AGNs can reach luminosities 
of the order of 10 48 erg-s -1 , which is 10, 000 times the average luminosity of a galaxy 
(intended as the characteristic luminosity of the field galaxy distribution). This is, 
the apparent luminosity, which is what we see after the radiation emitted has been, 
for instance, absorbed by the circum-object medium: the intrinsic luminosity could 
then be even higher. The energy output is also distinctive, and characterized by a 
broadband continuum emission. Standard galaxies have a spectrum which, at zero-th 
order can be considered the sum of the black body spectra of all the stars compos- 
ing the galaxy: since each star spectrum is approximately a black-body spectrum 
with characteristic temperature that equals the surface temperature of the star, 
and since the surface temperature of stars spans a range of about one decade, then 
a typical galaxy power output is emitted within about one decade of frequency. 
On the contrary, AGNs can have a spectrum ranging from the mid-infrared to the 
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hardest X-rays, and such that a narrower frequency band dominating the emission 
is missing. AGN emission can be moreover characterized by prominent emission 
lines, another element of contrast with most standard galaxies. The broadness of 
AGNs spectral lines can be both, i) very sharp or ii) present broad wings (with vari- 
ability spanning two orders of magnitudes, depending on the source). The emission 
also presents a marked variability at high frequencies (whereas, in the optical band, 
variability is about 10% on human life timescale): there is, nevertheless, a small 
class of AGNs (which will be mainly interested in what follows) having a much 
more marked, i.e. with much shorter timescales, variability. Although the polariza- 
tion of AGNs emission is weak, it can be statistically appreciated when compared 
with that of a standard galaxy: a frequency dependence of polarization is also usu- 
ally detected. Finally, AGNs can also be characterized by a strong radio emission, 
a phenomenology that has been extensively studied since the beginning of radio 
astronomical observations. 

An AGN is a source that presents some (not necessarily all) of the above prop- 
erties. For instance, the vast majority of AGNs have a spectrum characterized by 
a broadband continuum emission, with strong emission lines, some variability and 
weak polarization. Many of them also present a small angular size. Finally, in some 
cases radio emission has been detected and, occasionally, the variability and the 
polarization are both strong. 

Several objects fall within the above definitions and we will see later on that it is 
possible to devise a tentative unification scheme, the so called unified model. We will 
come to it after a concise description of the diverse observational phenomenology. 

A first kind of objects that shows properties typical of AGNs are the so-called 
quasi stellar sources, or QUASARs. QUASARs show two evident relativistic jets, 
but their main characteristic is a very luminous and unresolved nucleus with angular 
size smaller than the arc-second. Radio emission may be also present, in which case 
they are called, radio QUASARs. QUASARs without radio emission are instead 
called radio quiet QUASARs and they are about 20 times more common than radio 
QUASARs. Both radio and radio quiet QUASARs are found at high redshift and 
without signs of a surrounding galaxy. 

A second kind of objects in the AGN class are radio galaxies. Radio galaxies 
are characterized by a radio emission which is thousands times the radio emission 
of a standard galaxy: the radio emission is also apparent in two lobes, that extend 
in opposite directions outside a bright radio core, to which they appear to be con- 
nected by emission jets. In standard galaxies the radio emission can be traced to the 
production of relativistic electrons in supernovae explosions. In radio galaxies, in- 
stead, the radio emission, which is characteristically non-thermal, can be identified 
with synchrotron emission from ultra-relativistic particles. The spectrum, which is 
a broad continuum, is markedly non-thermal also in the optical range, where it 
especially features superimposed emission lines. 

With some properties similar to those of radio quiet QUASARs, Seyfert galaxies 
are also AGNs, but having a lower luminosity. A further classification within Seyfert 
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galaxies can be made based on their spectrum, that can feature i) broad band regions 
as well as narrow lines (these objects are called Seyfert 1 galaxies) or ii) only broad 
lines (these galaxies are named Seyfert 2). Seyfert galaxies are so close to radio quiet 
QUASARs that most of the times it is the way in which they are discovered that 
makes them members of one class or the other: in particular, in presence of a high 
rcdshift and the absence of a surrounding galaxy the objects tends to be classified 
as a QUASAR, while broad emission cores in known galaxies are usually classified 
as Seyfert galaxies. 

Finally we come to the last group of objects in the AGN family, namely blazars. 
Blazars are the most energetic sources that can be observed in the sky. They arc also 
characterized by two powerful jets shooting out in opposite directions at relativistic 
speed: their peculiarity is that we are able to view one of them at a small angle to 
the object axis (which is also the jet emission direction), which means that the jet is 
practically pointed at us. The emission spectrum is extremely broad, from radio to 7 
frequencies and allows an additional distinction: i) when the spectrum is completely 
missing emission lines we have the, so-called, BL Lacertae (BL Lac) objects, which 
are radio loud objects with marked variability and strong optical polarization; ii) 
when, instead, the spectrum also features strong broad emission lines in the optical 
range, we have what are called Optical Violent Variable objects, which in the radio 
band look similar to radio quasars. 

We will discuss in more detail some features of AGNs emission, with particu- 
lar emphasis on blazars, in subsection 11.11 Before moving to this topic, we would 
like, however, remark that despite their apparent difference form the observational 
point of view, a unified model to interpret all these observational features has been 
proposed; according to this unified model, all the different features of AGNs are 
related to the orientation of the source with respect to the observerP All AGNs 
would then be compact supermassive objects and they would be surrounded by 
an accretion disk composed of hot plasma emitting thermal radiation. The broad 
emission lines that are detected in some spectra, mostly in the optical range, would 
be caused by the presence of a toroidal shape region containing dense molecular 
clouds (called the broad emission lines region) : the broad emission lines region can 
obscure the view of the central part. At a larger distance from the central object 
there is also a narrow emission lines region also filled with molecular clouds but of 
density lower than the one present in the broad emission lines region. In a direction 
perpendicular to the plane of the accretion disk and in opposite directions, two jest 
of ultra-relativistic particles emerge from the central object and extend several kilo- 
parsecs away from the core. As we anticipated this model allows for explanation 
of the AGNs phenomenology, once we consider the orientation of the object with 
respect to the line of sight of the observer. Radio galaxies and Seyfert galaxies have 
the jets nearly orthogonal to the line of sight of the observer: the torus surrounding 
the core of the object obscures it and reprocesses the radiation coming from the 
disk and from the broad emission lines region. The lobes define the region where 
the jets loose collimation, and both are responsible for a strong radio emission. A 
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completely different picture of the same model is obtained if we change the orien- 
tation so that the jets now point in the direction of the observer's line of sight: now 
the boosted jet emission pointing directly to the observer is dominating over all the 
other components, and it is the physics of the jet that is mostly detected. At inter- 
mediate angles between the directions in which the line of sight is perpendicular or 
parallel to the jets direction, both the central objects and the surrounding regions 
are seen: these objects are quasars, and their characteristic emission lines are the 
result of the light coming from both the broad and narrow emission line regions. 

We will now provide more details on the emission model, with particular em- 
phasis on blazars. 

1.1. Blazars and emission models 

Among the realizations of AGNs that we have briefly described above a very im- 
portant role is played by blazars. Although only in < 10% of the cases it is possible 
to detect a jet structure in AGNs, a lot of effort has been put in understanding the 
origin of the jets and the physical processes that allow for particle acceleration in 
such highly collimated beams. A full modelling of AGNs and AGNs features forma- 
tion and evolution, is still one of the fundamental open problems in astrophysics; 
however, it is currently accepted that jets consist of low entropy flows associated 
to regions of internal/external shocks, within which the jets dissipate part of their 
energy. Since a relativistic jet that moves in a direction that forms a small angle 
with the line of sight of the observer is greatly amplified by relativistic beaming, 
blazars observations are an exceptional tool to investigate the nature and proper- 
ties of AGNs relativistic jets: indeed blazars' emission is dominated by the jet and 
makes these objects the major extragalactic source of 7-rays, despite the fact that 
they represent only a minority of the AGNs population. Two major subclasses can 
be identified within blazars of the BL Lac type: while all of them show a spectral 
energy distribution (SED) with two pronounced bumps, these bumps peak i) in the 
infrared the first and around GeV frequencies the second or ii) in the X-ray the first 
and around TeV frequencies the second. In particular, BL Lac objects of the first 
kind are called low frequency peaked BL Lac, while BL Lac objects of the second 
kind are called high frequency peaked BL Lac (HBL). 

It is generally agreed that the low energy peak is produced by synchrotron 
emission. Models can differ, instead, in the description of the very high energy 7- 
ray bump, and can be classified in two qualitative different groups: leptonic models 
and hadronic models. 

Leptonic models . These models explain the very high energy 7-ray radiation by 
inverse Compton scattering of photons off relativistic electrons/positrons. 
If the scattered photons are synchrotron photons created by the electrons of 
the jet, the models are called synchrotron self Compton (SSC) models 
If, instead, the scattered photons are ambient photons or photons com- 
ing from the environment, the models are called external inverse Compton 
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(EIC) models. The absence of emission lines in the blazars spectra seems 
to favor SSC models. SSC models can invoke one region in which the rele- 
vant interactions occur (one-zone SSC models), or be refined to take into 
account more than one region. In the following we are going to test a one 
zone SSC model on a set of nine simultaneous multi-wavelength SEDs. 
Hadronic models . In this case the models take advantage from the fact that a 
proton component in the jet would be subjected to less synchrotron losses 
as compared to electrons and could then be accelerated more efficiently.^ 
The low energy peak is again explained by synchrotron radiation, so in 
these models there is an electron components that is accelerated together 
with protons. The higher energy bump is instead produced by the interac- 
tion of the accelerated protons with matter and/or ambient photons and or 
magnetic fields. 7 '"' Proton induced cascades, that can in turn induce elec- 
tromagnetic cascades, and/or proton synchrotron models have also been 
considered! 211 ^ 

It is also not excluded that both processes could coexist in the jet! 16 * 17 ! m w hat fol- 
lows we will be interested in a leptonic model, specifically a one zone SSC modelPSESI 
This model has shown a good agreement with HBL broadband emission in both, 
the ground and excited statesPS Moreover, in one zone SSC models, since there is 
only one population of electrons that generates the doubly peaked emission, there 
is naturally a correlation between the X-ray and the very high energy 7-ray vari- 
ability. The emission zone is assumed to be a spherical blob of radius R, moving 
with a bulk Lorentz factor T in a direction forming an angle 9 with respect to the 
observer viewing direction. Special relativistic effects can then be described by a 
single parameter S = (T(l — /3cos^)) _1 . The model assumes that the spherical blob 
region is uniformly filled by electrons with density n c and by a uniform, tangled, 
magnetic field B. Five more parameters complete the model providing a descrip- 
tion of the relativistic electrons' spectrum. This is characterized by a smoothed, 
broken power- law in 7, the Lorentz factor of the electrons, which is bounded by 
7min < 7max- The transition between the two power- laws takes place at 7b r . The 
energy slope at low and high energies are, respectively, n\ and n-z- Altogether the 
model has nine free parameters: three of them (5, R and B) describe the emitting 
blob; the other six (n\, n.2, 7mm, 7max, 7br, n c ) describe the energy distribution of 
the electrons' plasma. 

It is crucial to realize that only simultaneous, multi-wavelength observations 
allow the determination of all the parameters of the model. In particular, if we 
did not have the very high energy 7-ray observations that became available with 
modern Cherenkov telescopes, we would have only knowledge of the synchrotron 
peak. This would give us information about the electrons' distribution, but not on 
the other parameters of the model, and certainly it would not help us to remove, for 
example, the residual degeneracy between the intensity of the magnetic field and 
the electron density. This simple example already gives an idea of the importance 
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of simultaneous multi-wavelength observations across all the emission spectrum of 
the object. 

To determine the parameters of the model that we just described, we will use 
a rigorous statistical approach, by fitting the SSC model to several simultaneous 
multi- wavelength datasets corresponding to different activity states of the source. In 
this way we will be able to constrain, within some significance level, the SSC model 
in each emission state (this is the primary goal of this contribution) . In turn, this 
allows also an analysis of the behavior of the parameters of the model in different 
emission states, a point which is discussed in detail elsewhere!^ 

Before continuing with the analysis anticipated above, we will introduce in the 
following sections some basic ideas about non-linear fits and their significance. 

2. Nonlinear x 2 fitting 

In this section we discuss a technique that allows to identify local minima of real 
valued functions and that we will use in the context of non-linear least squares 
problems: in particular, our final goal will be to use this technique to minimize 
the sum of the squares of the deviations between a given set O = {(xi, yi, <Ji)\i — 
1, . . . , N} of measured data points (in which <ii is the uncertainty in the measured 
quantity and uncertainties in Xi are considered small enough to be neglected) and 
a model function f(x; p) that depends from a set p = (Pj)j=i,...,n °f n parameters. 
Under suitable assumptions it can be proved^ that the optimal values for the 
parameters according to the observed data can be obtained by minimizing the Chi- 
Square function, i.e. 



When the function / is a linear function of the parameters, a closed formula for the 
minimization of \ 2 can be obtained. Here we are instead interested in the situation 
in which / is a non- linear function of the pj. The minimization process can then 
be performed numerically in several iterations, the goal of each iteration being to 
find a perturbation Spj of the current values of the parameters pj that results in 
a lower value of % 2 ■ Several methods can be developed to find the values of the 
parameters Pj that minimize x 2 . In view of our final goal, we will concentrate on 
two of these methods, namely the steepest descent method and the Gaufi-Newton 
or Inverse Hessian method. 

2.1. The steepest descent method 

The steepest descent methocP^H i s based on the evaluation of the gradient of the 
objective function (the \ 2 m our case) with respect to the parameters pj. The main 
idea of the method is that the most direct path in the direction of a local minimum 
is to descend in the direction opposite to the gradient of \ 2 with respect to the pj . 
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The components of the gradient of \ 2 i i- e - d Pj x 2 turn out to be 

l.jV r 



^* 2 = -£ 



Vi-f(xi;p) ' 

2 d P] f{x l \V) 



For later convenience we will set up a more compact matrix notation for the above 
relation, which is 

(grad pX 2 (p)) T = -(y-f) T SJ ! 
where the gradient is nothing but 

grad p x 2 (p) = {d Pj x 2 )j=i,...,n, 
y is the iV-dimensional constant vector of the dependent variable observed data, 

y = (yi)i=i,...,N, 

f is the N dimensional vector of the dependent variable values estimated according 
to the model described by f(x, p), 



f = f(p) = (/(z i ;p)) < =i,. 



,A<- 



£ is the N x N diagonal matrix of the weights corresponding to the dependent 
variable uncertainties in the N measurements 0, 

£ = diag(a^ 2 ) i= i i ... tN 

and, finalljQ, J is the Jacobean matrix of f , i.e., 

J = (9 P3 .f(p))j=i,...,„ = (d p J(x l -p)) i=i n . 

j=l,...,n 

A perturbation Sp = (Spj)j=i,...,n that updates the parameters in the direction of 
the steepest descent, i.e. in the direction opposite to the gradient of x 2 > can then 
be obtained as 

5p = /iJ T S(y-f), 

where we used the fact that, since S is diagonal, S T = S. is a positive real 
number that determines the length of the step in the steepest descent direction. 

Based on the above framework, the steepest descent method consists of a se- 
quence of parameters updates that are always performed in the direction of the 
steepest descent until a minimum is found with the prescribed accuracy. For simple 



In this notation the \ function can be written as 

X 2 (P) = i(y-f) T S(y-f) 



iy T Sy + y T Sf +if T Sf. (1) 
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objective functions the steepest descent method is recognized as a highly conver- 
gent approach to minimization and, when the number of parameters is high or very 
high, it can be considered as the most reliable, if not the only viable, method. There 
are, nevertheless, weak points of this method, especially for complex models: these 
weak points are substantially related to the fact that the method does not take into 
account the curvature of the surface which, in our case, is the graphic of the x 2 ■ 
Because of this, it is possible to make too large steps in steep regions or, too small 
steps in shallow regions: this clearly affects the convergence of the algorithm. At the 
same time, particular structures in the x 2 surface, as for instance narrow valleys, 
may also damage convergence. In the case of a narrow valley, for instance, we would 
need to move a large step in the direction that points along the flat base of the 
valley, but only a small one in the direction perpendicular to the valley walls. If it 
is true that second order information, i.e. the use of curvature information about 
the x 2 surface, would definitely help to improve the method, it is often (and as we 
will see later on, in our case in particular) computationally expensive to access this 
second order information. We will see that a good compromise can be found: to this 
end, we need to first analyze another approach to minimization, which we will do 
in the following subsection. 

2.2. The inverse Hessian method 

Another approach that can be used to determine a minimum (in particular of the x 2 
function described above) is the inverse Hessian (or Gauss-Newton) methodE). To 
give a sound motivation to this approac h 124 ! 25 ! let us first consider a particular case, 
i.e. the one in which the dependence of the model function from the parameters is 
linear. The model function can then be written as 

l.n 

f(x,p) = /«»(*) + (L^f p d ^ f(°\x)+J2Lf\x) Pj , 

j 

where L^(x) is the vector h^(x) = (Lj (x))j=i,...,n- F° r this linear model 
f=(/( )(x l ) + (L«(x. i )) T p) 

\ /»=1,...,JV 

and 

J = ((L«(^)) T ) i=1) ...^ 

so that 

f = f(°)+J P , f<°> ^ (/Wfa))^!,...,*- 

The x 2 is then a quadratic function of the parameters, 

X 2 (P) = \{y - f (0) ) T S(y - f<°>) - (y - f(°)) T EJp + ip^SJp 



We will use the first denomination here, although the second is also quite widespread. 
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and the minimum can be obtained in closed form by algebraically solving for p the 
linear equation 

-(y-f(°)) T SJ + p T H = 0, 

which was obtained remembering that, since £ is diagonal, H d =' J T SJ is an n x n 
symmetric matrix. The final result is the minimum point p m i n , 

p min . = H- 1 J T S(y - f(°>) = H- 1 J T S(y f) + p. 

In this special case (the model function is linear in the parameters and, thus, the 
X 2 is quadratic in p) we have that i) H is exactly the Hessian of the x 2 , ii) H it is 
a constant matrix (specifically, constant with respect to p) and iii) it is possible to 
write in closed form an exact solution for the minimum. 

The inverse Hessian method takes advantage of the above result, by using it 
to deal with the general case, in which the model function is generic. The way 
in which this is obtained is by iterating successive steps: in each of them a linear 
approximation of the model function around the current values of the parameters 
is used, which results in a quadratic approximation to the \ 2 . The exact solution to 
this linearized model is given by the equation derived just above and will provide 
us with a new value of the model parameters; of course, in the fully non-linear case 
these new values will unlikely realize the x 2 minimum, but they might turn out 
to be a much better approximation to it. By successive steps convergence to the 
minimum might eventually be achieved. The advantage of this method, as opposed 
to the steepest descent method described in the previous subsection, is that we are 
here using information related to the curvature of the x 2 surface that might allow 
us to reach more quickly the sought minimum. At the same time, if we are not close 
enough to the minimum, the linearized model will probably not be a good enough 
approximation of the fully non-linear model. Several steps might be required to 
arrive close enough to the minimum, where the method is particularly efficient. 

2.3. Levenberg and Levenberg-Marquardt methods 

As we have discussed in the previous subsections, both the steepest descent method 
and the inverse Hessian method have advantages and disadvantages. The first of 
the two, is iteratively trying to converge toward the minimum by updating the 
parameters as 

p^p + (5p, (5p = M J T £(y-f), M eK+; 

good convergence could be badly affected in situations in which the shape of the 
X 2 surface presents features that require an estimation of quantities related to the 
surface curvature, which could be the case in proximity of the minimum. The second 
method, was, also iteratively, trying to converge to the minimum by updating the 
parameters as 

p^p + 5p, ^p = H- 1 J T S(y-f); 
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good convergence is more likely achieved, in this case, close to the minimum, when a 
linear approximation to the model function is more appropriate, and the % 2 surface, 
locally, could be well approximated by a paraboloid. 

A vantage element, common to both models, is that they only require the cal- 
culation of the first derivatives of the model function (out of which J is made): 
no second derivatives appear in these methods, which would allow to spare com- 
putational time. Also, from the qualitative analysis that we have done of the two 
methods, they appear to be more effective in complementary situations, the steepest 
descent being likely efficient far away from the minimum, while the inverse Hessian 
could provide better convergence close to it. These considerations strongly motivate 
Levenberg proposalpSl which is to iteratively update the parameters according to 
the following rule: 

p^p + rlp, 5p= (H + AI)- 1 J T S(y-f), A e K+, 

where I is the n x n identity matrix. The positive real number A is fixed at a small 
value at the beginning of the computation: it is, then, dynamically adjusted by the 
algorithm according to the estimated distance from the expected minimum. When 
the algorithm estimates to be far from the minimum, A is progressively increased so 
that the contribution of H to H + AI becomes negligible and the method behaves as 
the steepest descent one. On the contrary, when the algorithm estimates to be closer 
to the expected minimum, the value of the parameter A is progressively reduced, 
until it will be so close to zero that the contribution of AI to H + AI will be negligible 
and, in all respect, the algorithm will be following an inverse Hessian approach. 
The quantity that is computed to decide the decrease/increase in A is the absolute 
difference between the value of % 2 and the value of its quadratic approximation, 
with the assumption that this approximation becomes better and better close to 
the minimum. 

Following Levenberg, Marquardt proposed a further refinement of the 
modelp3l23 hence the name of what can nowadays be considered a robust stan- 
dard for x 2 minimization, i.e. the Levenberg-Marquardt method. The proposal of 
Marquardt was to use, also during the steps closer to the steepest descent method, 
part of the information about the curvature of the % 2 surface encoded into H (we re- 
mark again that H is not the Hessian of the \ 2 but it is the Hessian of the quadratic 
approximation to the x 2 which is obtained by linearizing the model function) . The 
update rule for the Levenberg-Marquardt algorithm is, therefore, 

p^p + (5p, Sp= (H + Adiag(H))- 1 J T S(y-f), A e M+. (2) 

It is this method that we will apply in the study discussed in section |4] 

3. Statistical Significance 

In this section we would like to discuss some selected topics about what are usually 
called goodness of fit tests. In particular we will concentrate our attention on a 
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particular test, the Kolmogorov-Smirnov (KS) testP^H ^he KS test can be used 
to decide if a data sample is coming from a population with a specific distribution, 
and will be thoroughly discussed in section 13.31 As an introduction to this more 
specific topic, we will first recall the general problem. Then, in the next subsection, 
we will describe a standard approach that can be used with binned data. This will 
give us the possibility to appreciate some crucial difference (and advantages) of the 
KS test, which will be presented in the last part of this section. 

3.1. The general problem. 

Let us consider the case in which we are given two sets of data, say and : 
we may be interested in quantifying our certainty about the fact that the two sets 
of data are coming from populations having the same distribution function. To be 
more precise, let us consider the following statement, i.e. the null hypothesis JYq, 

JVq : THE TWO SETS OF DATA 0^> AND ARE COMING FROM THE SAME POP- 

ULATION DISTRIBUTION FUNCTION. 

We are interested in methods that allow us to disprove ,jVq to a certain level of 
confidence; if we can succeed in disproving jYq^ then we will conclude that and 
0( 2 ) are coming from different distributions. We remark that disproving jVq to a 
certain level of confidence is as far as we can go from the statistical perspective and 
that failure in disproving JVq only shows that at the given level of confidence it is 
consistent to consider the two sets of data as coming from the same distribution. 
In the general statement that we are discussing we did not make any assumption 
about and that, in general, can be coming from two different unknown 
distributions. Later on, we will be interested in a particular case, namely the one 
in which one of the two distributions is known. In this case, the null hypothesis will 
be 

jVq : THE SET OF DATA 0^> IS COMING FROM A POPULATION DISTRIBUTED AS V . 

3.2. The Chi-Square test 

The first approach that we will describe is an accepted standard to solve the above 
problem for binned data!22122Let us then consider a binning of the sets of data 0^ a \ 
a = 1,2, in N\> bins indexed by a set of integers i £ I, such data nf\ a = 1,2, is 
the number of observed points of the 0( Q ) data falling in the i th -bin (the binning 
intervals are the same for both sets of measurements). We can then construct the 
following estimator 

X 2 [0\0>;N h ] = £ (r \ " aft } , (3) 

where / = {j G I \ n^p — — 0} is used to exclude from the sum bins for which 
the corresponding term would not be well defined and the = 1,2, are defined 
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according to the following relations: 

j V (-) = 2 n (-) =#0 (-) i r W = (y , a = 1,2. 
iei ^ ' 

To consider the correct x 2 statistics, we also need the number of degrees of freedom, 
v, that can be associated to the test. This number is v = N\, — N c , where N^, from 
above, is the number of bins, and A c is the number of independent constraints that 
have been imposed on the sets of data. 

A similar test can be applied to the case in which we have only one set of data, 
say and we would like to disprove if is coming from a given distribution 

V. As before let us consider a binning of data in N\> bins; again nj 1 ^ will be 
the number of data falling in the i th -bin. We will, moreover, define Si as the 
number of data expected in the i th -bin if the data were distributed according to T> 
(of course, Si does not need to be an integer). If I is the set of integers indexing the 
binning and I = {j S I \ n'p = dj = 0}, we can define the following estimator 

X 2 [0«;P;A b ] = J2 ( "' (1) ~* )a - (4) 
iei\T 

We remark that, in this case, there are some potentially not well-defined terms in 
the sum, i.e. terms corresponding to bins in which <5j = and n| = 0. These 
terms mean that, according to the distribution T>, there are no results expected 
in the given bin, whereas the observed data do have occurrences in the bin: this 
is a simple case in which it is disproved that the data can be obtained from the 
distribution T>. As in the former case the number of degrees of freedom is needed to 
have the correct % 2 statistics. If -/V p is the number of parameters required to know 
the distribution V that have been determined from the data, and if the occurrences 
in each bin that are expected from the model, Si, are fixed (and not renormalizcd 
to match the total number of observed points) then v = — N p . If, instead, the 
constraint J^i&i 71 ^ = J2iei &i i s imposed, then v = Ny, — N p — 1. Additional 
independent constraints that should be present, decrease accordingly the number 
of degrees of freedom. 

Wether we are working in the first framework, with two sets of data, or in 
the second, with only one set of data and a comparison distribution, we end up 
with an estimator {x 2 [O x , O 2 ; Ny>] or x 2 [0W; V; N^] respectively) and an associated 
number of degrees of freedom v. In what follows we will write briefly x 2 [^Vb], since 
our considerations can be applied in the same way to both cases and we wish to 
explicitly emphasize the dependence from the binning that we had to perform. 
In both definitions of x 2 [Ab] the terms in the sums are not individually normal. 
However in the limit in which the number of bins, Ab, is large enough, or the 
number of events in each bin is large enough, it is standard practice to consider 
the above defined x 2 [Ab] as the sum of the squares of v normal random variables 
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of unit variance and zero mearQ. The Chi-square probability function Q(x 2 \v), i- e - 
the probability that the sum of the squares of v random normal variables with unit 
variance and zero mean is greater than \ 2 , can be used with X 2 [-^t>] to test our null 
hypothesis. Q(x 2 W) i s defined as 

T(u/2) J x 2 /2 

and it is tabulated for convenience in statistics textbooks. 

The Chi-Square method we just recalled works with binned data; although it 
is always possible to obtain binned data from continuous data, there is often a 
great deal of arbitrariness in the binning process and it is likely that the outcome 
of the test will depend on the binning (a fact which we already emphasized in 
our notation by writing x 2 (-/V D ) above). At the same time the Chi-square method 
makes an assumption about the normality of the data: although this assumption is 
at the background of many statistical results, it might not be always satisfied and 
it would be desirable to obtain ways to test the null hypothesis without relying on 
this assumption. This turns out to be possible for continuous distribution, and an 
accepted standard is the Kolmogorov-Smirnov (KS) test, which is discussed in the 
next subsection. 



3.3. The Kolmogorov-Smirnov test 

As we anticipated the KS test can be used for continuous distributions. It avoids 
binning, it does not assume normality, and it is based on the concept of empirical 
cumulative distribution function which we will soon introduce. We will start our 
analysis by considering the case of ,yV ', assuming that the results in are the 
results obtained by sampling N = independent identically distributed random 
variables Xi, i — 1, . . . , N, distributed according to some unknown distribution V. 
We will denote by P(x) the cumulative distribution function associated with V, i.e. 
P(x) d =' Prob(A"i < x). An empirical cumulative distribution function is a way to 
count how many of the observed points can be found below the value x, and it is 
defined as 

1 1,JV 

P N(x) = -Y,%X i <xl 

i 

where — 1 if ^ is true and I(^) = otherwise. Pn(x) counts the propor- 
tion of £>W that can be found below x in steps of 1 /N. By the law of large numbers 
it can be seen that the proportion of that can be found below x tends to the 



e This is the reason why the denominator of is twice the average of n t and : indeed, since 
the variance of the difference of two normal variables is the sum of the two individual variances, 
twice of their average is what is required to obtain for each term of the sum a random variable 
with unit variance. 
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cumulative distribution function P{x) 1 i.e. 

Pn{x) — > P{%) in probability. 
We will now prove a first resultPm 

Proposition. // P(x) is continuous then the distribution of 



sup\P N (x) - P(x)\ (5) 



does not depend on P. 
Proof. 



We are interested in the behavior of the distribution of ([5]), i.e. 

Prob ( sup \P N (x) - P(x) \<t). (6) 



Let us define the function P 1 (z) d =' min{a;|P(a;) > z}, where < z < 1 because this is 
the range of P, and preliminarily calculate 

l,N l,N 

p N (x) = p n (p-\ z )) = ^E J (^ ^ p_1 (*)) = jf Ew^ ^ z )- ( 7 ) 

i i 

The above expression, contains P(Xj), which is a uniform distribution on the interval 
[0, 1], because the cumulative distribution function F(Xi) is given by 

Prob(P(Xi) < t) = Prob(Xi < P^{t)) = P(P _1 (t)) = t. 

This implies that the random variables Yi = P(Xi), i = 1,...,N, are independent and 
uniformly distributed on [0, 1], so that we can continue the chain of equalities in ((7|) to 
obtain 

l,N 1,N 
Pn(x) = PatIP- 1 (z)) = ... = !£; I(P(X,) <2) = -E I« < *), (8) 

in which the last term is independent from P. 

The proof can now be quickly completed by performing a change of variable in ([6| and 
then using the result in (JSj ; we get 

Prob(sup \Pn{x) - P{x)\ <t). = Prob [ sup \P N (P^ 1 (z)) - P(P _1 (0))| < t 



< t 



xSR \0<z<l 

1,JV 



= Prob I sup 

\ 0<z<l 

which shows that © is independent from P, Q.E.D. 



The above results imply that uniformly over K we have 



sup I P w (2;) -P{x)\ — > 
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(i.e. the largest difference between P/v and P goes to zero in probability) and that the 
distribution of the above supremum does not depend on the unknown distribution of 
the sample O^, i.e. P, whenever P is continuous. The final step that motivates the 
KS test follows from the observation that, given x, the central limit theorem implies 
that VN(P N (x) - P(x)) converges in distribution to a normal distribution with 
zero mean and variance P(x)(l — P(x)) (because this is the variance of I(Xl < x)). 
Moreover, y/~N sup xGR \Pn(x) — P(x)\ also converges in distribution, as shown by 
the following proposition. 

Proposition. The cumulative distribution function of \/N sup xg g \Pn(x) — P(x)\ 
is such that 

Prob I ViVsup \P N (x) - P(x)\ <*)—>■ P KS (t), (9) 

where PksW = 1 — 2 l) fc_1 exp(— 2fc 2 t) is the cumulative distribution func- 

tion of the Kolmogorov-Smirnov distribution. 

Proof. 



The proof of this result will not be given hereP^EIl 



The net result of the above analysis is that if Pq is the cumulative distribution 
function of the distribution associated with jVq , then we can consider the statistics 

D N = VN sup \P N (x) - P {x)\, (10) 

which will depend only on N and can then be tabulated. 28 29 If N is big enough the 
distribution of Dn is approximated by the Kolmogorov-Smirnov distribution. We 
will now consider what happens if JV^ fails, which means that P ^ Pq: in this case 
the empirical cumulative distribution function will converge to P, it will then not 
approximate Pq and for large N we will have sup^gg \Pn(x) — P (x)\ > e for some 
small enough, positive e, i.e. 

D N = ^ sup\P N (x) - P (x)\ > VNe. 

This will allow to define a decision rule in the form < c, where the constant c is 
defined by the significance level and the decision rule can be verified by tabulated 
values for Dpf. 

The Kolmogorov-Smirnov test can also be applied to where we are inter- 
ested in understanding if the two sets of data and are coming from the 
same distribution. Let {^"' l } J : = i,... l Ara be the sample of having cumulative 
distribution function P^ a \ a = 1,2. If P&L), cc = 1,2, are the corresponding em- 
pirical cumulative distribution functions, then the two propositions above are also 
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satisfied by the following statistics, 



N (l) N (2) X 1 / 2 



p(l) _ p(2) 

JVd) JV< 2 > 



to which a similar decision rule as the above can be applied. 
3.4. Statistical Significance of x 2 fits 

In the following we will be interested in determining the statistical significance of 
fitting observed data points to a highly non-linear model function. In addition, the 
model function will depend from several parameters, and, although at the present 
level of knowledge these parameters are considered independent, the complexity 
of the physical situation makes it possible that, in more refined models, some of 
them might show correlations. In addition to the complex structure of the models, 
there is the fact that the data points will be spread across several decades (range 
of the independent variable) and will come from different instruments based, not 
only on different hardware/software, but also on different physical processes for 
their operation. It is important under these circumstances to have a check on the 
goodness of fit. Because of the complexity of the situation, the approach that we 
propose is to perform a Kolmogorov-Smirnov test on the normality of the residuals 
obtained after the fitting procedure. In this case our null hypothesis will be 

yy Q " : THE RESIDUALS OF THE NON-LINEAR FIT OF MULTI- WAVELENGTH DATA TO 
THE SPECTRAL ENERGY DISTRIBUTION FUNCTION OBTAINED BY IMPLE- 
MENTING A GIVEN SYNCHROTRON SELF COMPTON MODEL ARE NORMALLY 
DISTRIBUTED. 

As a final remark, we remember that there are situations in which the critical 
values of the test statistics can be difficult to calculate: these include situations 
in which the samples are of small size and/or the parameters of the distribution 
are estimated with the same data that are being tested. For the second case a 
convenient solution is the inclusion of a correction factor, but, in general, the safest 
way to procede is to use Monte Carlo methods, for instance to generate, under the 
fitted distribution, datasets of the same length as the tested one, and use these them 
to obtain the correct critical values. 



4. An application: Markarian 421 

As a field test application of what we have seen in the previous sections, we will 
now apply the models and techniques introduced above to a concrete case, i.e. the 
study of the emission properties of the AGN Markarian 421, following a recent 
publication. 21 This section will be divided in subsections. In each of them we will 
discuss one of the aspects that we have introduced in the sections above and there 
will be a direct correlation with the subsection number and the section number in 
which the concepts exemplified in each subsection have been discussed in general. 
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4.1. The source 

Markarian 421 (Mrk421) is the closest blazar (at a redshift z — 0.030) and the first 
extragalactic 7-ray source with emission in the TeV range detected by Imaging Air 
Cherenkov Telescopes! 35 * 36 ^ It is, nowadays, the most well known blazar, together 
with the, also close one, Markarian 501, and falls within the class of HBL objects. 
Mrk 421 is a source that shows remarkable variability, both in flux variations (that 
were observed to change by almost two orders of magnitude) and time development 
(flux doubling was detected on a time scale of the order of 15 minutes^). 

For our purpose Mrk421 is an excellent source, which has been extensively stud- 
ied across over 19 decades in energy. In particular: 

(1) the SSC emission dominates the detected spectrum, with correlated low-high 
energy fluctuations; 

(2) the Compton peak is in the range where Cerenkov observations are effective; 

(3) the spectrum can be described as a single power-law. 

For the above reasons, the one-zone SSC model that we have described in the 
previous section is a good candidate to describe the SED of this source. At the 
same time simultaneous multi-wavelength observations are available, and, among 
them, it was possible to identify nine, good to very good quality, spectral energy 
distribution sets of simultaneous data corresponding to different emissions states. 
The detailed description of the datasets is reported below. 

state 1 and state 2 . The first two datasets that we consider correspond to multi- 
wavelength data obtained from campaigns triggered by a major outburst of 
Mrk421 that was detected by the 10m Whipple telescope in April 2006P1 It 
was unfortunately impossible to promptly set-up a multi-wavelength campaign 
because of some visibility constraints on XMM-Newton; for this reason simul- 
taneous observations at different wavelengths were taken during the decaying 
phase of the flare. The optical monitor of XMM-Newton was used to collect 
optical and ultraviolet data, whereas the EPIC-pn detector of the same tele- 
scope provided X-ray observations. Very high energy 7-ray data were collected 
by MAGIC and Whipple telescopes. Altogether this resulted in more than 7 
hours of simultaneous observations, about 4 hours of which form the datasets 
for state 1 and more than 3 hours the dataset for state 2. 

state 3 . A third dataset contains multi-wavelength observations initiated after a 
detection by the all-sky monitor of the Rossi X-ray Timing Explorer and by 
the 10m Whipple telescope! 3 ^ The observation campaign continued during De- 
cember 2002 and January 2003. The very high energy data was obtained by 
Whipple between December 4, 2002 and January 15, 2003 and by the High En- 
ergy Gamma Ray Astronomy CT1 between November 3, 2002 and December 
12, 2003. However, since our analysis is centered on simultaneous observations, 
we have used only the very high energy data taken at nights during which simul- 
taneous X-ray observations were available. Optical information consists of the 
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average flux obtained from the Boltwood observatory optical, KVA and WIYN 
telescopes. 

state 4 and state 9 . Two more datasets are the result of a longer campaign un- 
dertaken during 2003 and 2004P The Rossi X -ray Timing Explorer was used 
to collect the X-ray flux, that was then classified into three sets, having low-, 
medium- and high-flux, respectively. X-ray observations where then comple- 
mented by Whipple very high energy 7-ray data, taken within one hour of 
the selected X-ray ones. Whipple Observatory 1.2m telescope and Boltwood 
Observatory 0.4m telescope provided optical data following the same grouping 
method: although it was not possible to get optical data simultaneously with 
the remaining multi-wavelength data, the fact that minor variations in the op- 
tical flux were detected, allowed to consider its maximum and minimum values 
as reliable approximations. In this way it was possible to obtain a medium flux 
dataset, corresponding to state 4, between March 8 and May 3, 2003 and a high 
flux dataset, corresponding to state 9, between April 16 and 20, 2004. 

state 5 and state 7 . Another two datasets are the result of a multi- wavelength 
campaign that took place between March 18 and March 25, 2001 X-ray data 
are the results of Rossi X-ray Timing Explorer observations, whereas the very 
high energy 7-ray flux was obtained by the Whipple telescope. State 5 corre- 
sponds to a post-flare state during March 22 and 23, for which optical informa- 
tion corresponds to the lowest flux detected by the 1.2m Harvard-Smithsonian 
telescope on Mt. Hopkins. State 7 is, instead, the high-flux peak of March 19, 
also complemented by optical data from the same instrument as for state 5, but 
using the highest optical flux. 

state 6 . This dataset were taken after an outburst in May 2008 and contains the 
results of about 21/2 hours of simultaneous observations As for state 1 and 
state 2 the optical monitor of XMM-Newton was used to collect optical and 
ultraviolet data, whereas the EPIC-pn detector of the same telescope provided 
X-ray observations. The very high energy 7-ray flux was in this case obtained 
by VERITAS. 

state 8 . The last dataset is the result of a dedicated multi-wavelength campaign 
on June 6, 2008.^ Optical data was obtained by WEBT, whereas X-ray obser- 
vations were made by the Rossi X-ray Timing Explorer and by Swift /BAT and, 
finally, very high energy 7-ray fluxes were taken by VERITAS. 

All sets of data present marked qualitative difference between the optical to X-ray 
and the very high energy 7-ray ranges: the most striking one is the uncertainties 
in the measured flux, which is very small when not negligible at low energies, and 
much sizeable at very high energies. This observation will play a role in our future 
discussion on the statistical significance of the fit results. In what follows we will, 
first, show how to fit each of these datasets to the chosen SSC model and how to test 
the goodness of the fit. An in depth interpretation of the physical conclusions about 
the source is beyond the scope of this lecture and can be found in the literature!^ 
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4.2. Nonlinear x 2 fit 

The fit algorithm will be an implementation of the Levenberg-Marquardt method 
discussed in subsection 12.31 As this is a standard approach in nonlinear minimiza- 
tion, it is possible to conveniently find code which is optimized and efficient for 
general problems. In particular we have used as a starting point the mrqmin func- 
tion discussed in Ref. 1241 This subroutines executes a single minimization step, 
which is an update step on the parameters as defined in ([2]) . Implementation details 



START 

assign Pq, initial point 
in SSC parameter space, 
and Axni; set cni = 



2 <Po) 



compute x 

xl = x 2 {P») 



output values of x 2 , 
SSC parameters and 
associated uncertainties 
END 



cni = 



determine P, next point 
in SSC parameter space 



xl = 


x\P) 


increase 


weight of 


inverse 


Hessian 



NO 



NO 



.1 YES 

| CNI < 4 k 



increase weight of 
steepest descent * j 



X 2 {P)>xl 



] compute X 2 (P) \ 



x 2 (P) < xl 



iX 2 (P)-xl<A X l 1 



YES 



cni = cni + 1 



Fig. 1. Flow chart of the minimization algorithm (from Ref. I21|l . A careful choice of Axj^j an< f 
of the exit value for cni is necessary to obtain consistent results avoiding, at the same time, 
unnecessary computational cost. The optimal value for the exit condition on cni is used in the 
flow-chart. Axjjjj as small as 10 — 4 was required, especially for sets containing larger number of 
data points. 



and additional functions called by mrqmin are described in in Ref. [24] and will not 
be discussed here, except for the case of the code that is used to evaluate the SED, 
which we will analyze in more detail in a moment. The single minimization step 
performed by mrqmin (or by any other implementation of the Levenberg-Marquardt 
method) has to be iterated until convergence to the sought minimum is considered 
satisfactory. The flow-chart of the code that we used is reported in Fig. [TJ After 
fixing some trial values for the parameters of the model (Pq point in parameter 
space) and initializing to zero an integer variable, cni that will count how many 
consecutive individual minimization steps have been performed with a negligible 
improvement in decreasing the x 2 , we calculate Xcu i- e - the value of x 2 at the initial 
point Po in parameter space. Minimization iterations then start: at the beginning 
we fix the parameter corresponding to A in (|2[l so that we are performing a step 
in which the steepest descent contribution to the algorithm is dominant. The new 
value of x 2 at the new point in parameter space P determined by the algorithm, 
X 2 (P), is calculated. If the x 2 did decrease in the step, we check if it decreased by 
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a sizeable amount. If not we increment by one the Cni counter, before increasing 
the weight of the inverse Hessian method and moving to the next iteration. If, in- 
stead, the x 2 increased at P, we increase the importance of the steepest descent 
method and reset the counter cni to zero. The exit condition from the loop is satis- 
fied when cni = 4. Another crucial parameter that has to be fixed is the threshold, 
below which we consider negligible the decrease in \ 2 (this is called A% NI in the 
flow-chart). It is usually considered that it is unnecessary to determine to machine 
accuracy or to roundoff limit the minimum of the % 2 , as the result provides only 
a statistical estimation of the parameters anyway. However, in our experience, it is 
important to pay particular attention in setting the exit condition value of Cni and 
the negligible \ 2 increment, Ax^ P Our experience has shown that the best results 
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Fig. 2. Snapshots of different steps in one run of the minimization code. The step numbering 
is conventional and there is no rigorous correspondence between algorithm iteration number and 
step number in the figure above, except that a higher step number corresponds to an iteration 
that follows the iterations associated to lower- numbered steps (from Ref. |2U . 
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were obtained witt@ Cni = 4 and Ax^i = 10~ 4 . As a further test of the minimiza- 
tion algorithm, we implemented an additional step: after obtaining the minimum we 
perform a random change of the parameters and repeat the minimization from this 
new random point in parameter space. This could help to identify cases in which 
minimization remained stuck in a local minimum different from the absolute one, 
but we never faced this situation, i.e. the additional minimization step did always 
converge to the result of the first one. When we choose smaller/larger values for 
cni/Ax^jj (for instance, 2 and 0.01) the minimization occasionally provided a result 
which, on closer inspection, turned out to be a not good approximation to the % 2 
minimum^. This tendency was extremely more marked in presence of datasets hav- 
ing a larger or much larger number of data points: in our experience convergence is 
usually slower in these cases: a tentative visual representation of some steps in the 
minimization process is given in Fig. [2] (for details, please see the related caption). 

From the point of view of the computation time the Levenberg-Marquardt al- 
gorithm is quite efficient)^ and most of the time was required by the derivation of 
the SEDs corresponding to the current values of the parameters at each iteration. 
In our analysis, following a common assumption in the literature, we have fixed 
7mm = 1; the redshift of the source is known, so high energy data points can be 
corrected to take into account interaction with the extragalactic background light. 
We are, then, left with eight parameters to be determined by the fit. According to 
((2]) at each step we need: 

y : this is just the vector of the observed flux values, which is clearly known; 
S : this is the vector of the flux uncertainties, also known; 

f : this is the vector of the values of the model SED evaluated at the observed 
energy / frequency ; 

J : this is a matrix which is known once the derivatives of the model SED are known; 
H : this matrix can be calculated knowing S and J. 

It is clear that y, S and H do not represent a problem, f and J in standard 
cases, where the model function has a known analytical expression, also do not. 
In our case (and in several others), however, the model SED is not analytically 
known and what we have is just a discrctized sample resulting from a numerical 
implementation of the SSC model; it is this last numerical implementation that can 
be more computationally intensive. This is especially true since the estimation of J 
requires the SEDs partial derivatives with respect to the parameters. Then, for each 
minimization iteration, we have in principle to evaluate a number of SEDs equal to 



The values of the two parameters may be correlated, so other choices may work as well. 
g Meaning that repeating the minimization with the optimal values for the parameters resulted in 
an appreciable different and lower \ 2 (and in more consistent values for the obtained uncertainties 
on the parameters, a point which we will discuss in more detail later on). 

h As an example, the longest minimizations required about 10 2 iterations that on a less than 
average PC (with a Intel® Core™2 Duo CPU (E7500, 2.93GHz) running a x86.64 GNU/Linux 
with 2GB Ram) took about 20 minutes to run. 
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Fig. 3. Plot of the SEDs obtained for each state with the x 2 minimization procedure described 
in the text (from Ref, |2H . Results for the parameters and associated uncertainties, together with 
the values of the reduced \ 2 m each case, are reported in Tables [l] and [2] 



twice the number of varying parameters plus one. In our implementation we tried to 
reduce the load caused by this task by developing a bookkeeping mechanism that 
caches some of the numerically sampled SEDs used in the previous steps: this is 
especially useful when the iteration results in a x 2 increase, since in this case, when 
returning to the previous point in parameter space, the required SEDs are already 
available. Caching optimizes the number of times at which the x 2 minimization 
executable needs to stop and wait for the completion of an external module that, 
independently, executes the SEDs evaluation. 

A first study obtained applying the x 2 minimization algorithm on the nine 
datasets described in Subsec. I4.ll results in the SEDs plotted in Fig. [3] and in the 
values of the parameters and related uncertainties listed in Tables Q] and [2] We 
are not interested here in a detailed report of the physical conclusions that can be 
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Table 1. Best-fit single-zone SSC model parameters for the nine datasets of 
Mrk 421. States are labelled according to the convention in Fig, l3l (from Ref. I21|l. 
Here parameters describing the emitting blob together with the reduced x 2 are 
reported. Results for the remaining parameters can be found in Table [2] 



Source 


B 


R 


5 


x 2 




[gauss] 


[cm] 






State 1 


(9 ±3) x KT 1 


(9 ± 4) X 10 14 


(2.0 ± 0.5) x 10 1 


0.84 


State 2 


(8 ±6) x lcr 1 


(8 ± 4) x 10 14 


(2.7 ± 1.1) x 10 1 


1.86 


State 3 


(6 ±6) x icr 2 


(2.0 ± 1.5) x 10 15 


(1.0 ±0.5) x 10 2 


0.91 


State 4 


(1.21 ±0.16) x icr 1 


(1.1 ± 1.3) x 10 15 


(8 ±6) X 10 1 


0.89 


State 5 


(1.9 ±1.3) x icr 1 


(10 ±4) x 10 14 


(7 ±5) x 10 1 


0.67 


State 6 


1.0 ±0.7 


(6 ± 3) x 10 14 


(2.8 ± 1.1) x 10 1 


1.39 


State 7 


(4 ±3) x KT 2 


(2 ± 5) x 10 15 


(8 ± 7) x 10 1 


1.61 


State 8 


(6 ±3) x 10" 2 


(2.0 ± 1.8) x 10 15 


(1.1 ± 0.4) x 10 2 


0.60 


State 9 


(4 ±3) x 10" 2 


(2 ±4) x 10 15 


(1.2 ± 1.0) x 10 2 


0.85 



Table 2. Best-fit single-zone SSC model parameters for the nine datasets of Mrk 421. States are 
labelled according to the convention in Fig. \3\ (from Ref. 21). This table lists the parameters that 
describe the energy distribution of the electron plasma. The parameters describing the emitting blob, 
together with the reduced x 2 obtained in the fit, can be found in Tabic [T] 



Source 




Tbr 


7max 


ni 


112 




[cm -3 ] 










State 1 


(1.3 ±1.5) X 10 3 


(2.6 ±0.9) x 10 4 


(1.05 ±0.18) x 10 7 


1.49 ±0.19 


3.77 ±0.11 


State 2 


(1 ±3) x 10 3 


(2.4 ±0.9) x 10 4 


(4.1 ± 1.1) x 10 6 


1.5 ±0.3 


3.62 ±0.14 


State 3 


(5 ± 5) x 10 3 


(7 ± 3) x 10 4 


(7 ±5) x 10 7 


2.05 ±0.10 


4.8 ±0.3 


State 4 


(2 ± 5) x 10 3 


(4 ± 2) x 10 4 


(8.2 ± 1.7) x 10 6 


1.8 ±0.3 


4.11 ±0.13 


State 5 


(2 ± 5) x 10 3 


(4.5 ± 1.9) x 10 4 


(2.4 ±0.3) x 10 7 


1.7±0.3 


4.3 ±0.18 


State 6 


(4 ± 4) x 10 3 


(1.9 ±0.6) x 10 4 


(1.8 ±0.4) x 10 6 


1.54 ±0.11 


4.37 ±0.09 


State 7 


(1 ± 7) x 10 3 


(8 ± 6) x 10 4 


(7± 2) x 10 6 


1.7±0.4 


4.23 ±0.20 


State 8 


(4 ±9) x 10 1 


(5 ± 2) x 10 4 


(1.6 ±0.4) x 10 7 


1.5 ±0.2 


4.22 ±0.14 


State 9 


(1 ±7) x 10 2 


(8 ± 9) x 10 4 


(1.1 ±0.4) x 10 7 


1.6 ±0.5 


3.9 ±0.2 



drawn from these results, for which we refer the reader to Ref. [22 We will, instead, 
consider some elements relevant to the statistical analysis. 

First, reduced % 2 values are reasonable: a couple of cases (states 5 and 8) might 
require some additional check, as values are slightly low. Uncertainties in most cases 
allow to constrain parameters within physically meaningful and expected ranges. 
There are some exceptions, in which uncertainties tend to be larger than usually 
acceptable, like in the case of the magnetic field of state 3 or the blob radius of state 
5 and, for most states, the blob electron density. In the cases in which the uncer- 
tainties appear to be too high, it is important to remember that these uncertainties 
are obtained by considering a quadratic approximation to the \ 2 near the estimated 
minimum. This is a good approximation only when the uncertainties are relatively 
small, because we can not expect the \ 2 surface to behave as the quadratic approx- 
imation far from the minimum. It might also happen that, because of the nature 
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of the problem/model, the x 2 nas a much more flat minimum in the direction of 
some of the parameters. In all these cases the quadratic approximation might over- 
estimate the uncertainties and it would be preferable to use the criterion that gives 
the uncertainties as the absolute difference between the minimum value of each pa- 
rameter at the estimated minimum and the value of the parameter at which the 
X 2 has increased by one. This definition of the uncertainty in the parameters gives, 
in general, asymmetric error bars, which can be an additional desirable features in 
situation in which the uncertainties make parameters that should be positive, com- 
patible with zero or negative values. Apart from the above considerations, the fitted 
parameters appear compatible with what is expected for this source. It is however 
important to consider in more detail the statistical significance of these results by 
applying some goodness of fit test, which we will discuss in the following section. 

4.3. Statistical significance 

After obtaining the fit parameters, we can now proceed with the last step, i.e. 
discuss their statistical significance. To this end we will consider, for each of the 
nine fitted SEDs, the residuals of the fit, i.e. the differences between the observed 
points and the value of the fitted SED at the frequency of the observed points. We 
then apply the KS test to verify if the residuals are normally distributed (the ,jYq 
null- hypothesis of page IT8)) . Code for the calculation of the relevant statistics Dm 
(cf. Eq. (fit)])) is available, for instance in Ref. [M] In this study, however, we used 
the functions that are included in Mathcmatica® since version 8.0: the reason for 
this is the fact that these functions already implement a method for the Monte 
Carlo approach that we briefly mentioned at the very end of Sec. [3] on pag. [18] A 
first application of the KS test at the 5% confidence level, shows that the residuals 
are not normally distributed, i.e. the KS test fails. Following this result, we applied 
the KS test again, this time at the 10% confidence level: again the test failed in 
all cases, showing also at this confidence level that the residuals are not normally 
distributed 

Failure of the KS test shows that the statistical significance of the fits should 
be carefully re-considered. In this case it might be actually possible to explain the 
reason for this failur^l, but we will not proceed here in this direction. We will 
instead draw a bold conclusion and emphasize the fact that, in absence of other 
explanations, the failure of the KS test could already bring us to the conclusion 



'To have a more clear understanding of the situation, in each case we then decided to divide the 
data in two groups, data at low energy (i.e. within the Synchrotron region of the spectrum) and 
data at very high energy (i.e. within the Compton region of the spectrum). For each dataset we 
then calculated the residuals for the low-energy subset of the data and for the high energy subset 
of the data. We then applied to all these subsets the KS test (we called this procedure piecewise 
KS test, as for each dataset the KS test for is applied separately to low- and high-energy 
data). Surprisingly enough, in all these cases the KS test confirmed normality of the residuals. 
Further discussion of this point can be found elsewherePH For our present purpose this further 
analysis is not necessary and we will ignore it here. 
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that the model might require improvements: existing observations, even in presence 
of a successful fit with reasonable values for the parameters and for the reduced x 2 > 
could be enough to show the inadequacy of the model. We could have never reached 
this result, had we not proceeded through a rigorous statistical approach and we 
recognize, in this way, the extreme importance of an in depth statistical analysis to 
put to the best use our models and the related observations. 

5. Conclusion 

In this contribution we have discussed all the steps that are required to perform a 
rigorous statistical analysis on simultaneous multi-wavelength datasets of blazars. 
Although it is challenging to obtain good datasets, because observations are often 
made difficult by the necessity to have several instruments simultaneously available 
in absence of observational constraints, we find really exciting to think that such 
observations (which probe several order of magnitudes in the source spectrum with 
different instruments and techniques) could be already at a good enough level to 
allow us to discriminate between different emission models. The importance of a 
statistical approach is, indeed, two-sided. On one side it can force us to improve 
our models, to make them compatible with the data. On the other, it can help 
us to understand how to plan future instruments and observations and efficiently 
improve, where it is most needed, the quality and amount of the data. We have 
finally shown, in the specific case of Mrk421, how this analysis has been applied to 
a specific problem and how existing data could be already suggesting the need for 
refinements of the emission model for this source. 
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